Interactive Learning of Node Selecting Tree Transducers⋆
نویسندگان
چکیده
We develop new algorithms for learning monadic node selection queries in unranked trees from annotated examples, and apply them to visually interactive Web information extraction. We propose to represent monadic queries by bottom-up deterministic Node Selecting Tree Transducers (Nstts), a particular class of tree au-tomata that we introduce. We prove that deterministic Nstts capture the class of queries definable in monadic second order logic (Mso) in trees, which Gottlob and Koch (2002) argue to have the right expressiveness for Web information extraction, and prove that monadic queries defined by Nstts can be answered efficiently. We present a new polynomial time algorithm in Rpni-style that learns monadic queries defined by deterministic Nstts from completely annotated examples, where all selected nodes are distinguished. In practice, users prefer to provide partial annotations. We propose to account for partial annotations by intelligent tree pruning heuristics. We introduce pruning Nstts-a formalism that shares many advantages of Nstts. This leads us to an interactive learning algorithm for monadic queries defined by pruning Nstts, which satisfies a new formal active learning model in the style of Angluin (1987). We have implemented our interactive learning algorithm and integrated it into a visually interactive Web information extraction system – called Squirrel– by plugging it into the Mozilla Web browser. Experiments on ⋆ A previous version of this article was published in Machine Learning 66,1 (2007) 33–67. 2 Julien Carme et al. realistic Web documents confirm excellent quality with very few user interactions during wrapper induction.
منابع مشابه
Learning n-Ary Node Selecting Tree Transducers from Completely Annotated Examples
We present the first algorithm for learning n-ary node selection queries in trees from completely annotated examples by methods of grammatical inference. We propose to represent n-ary queries by deterministic n-ary node selecting tree transducers (n-NSTTs). These are tree automata that capture the class of monadic second-order definable nary queries. We show that n-NSTT defined polynomially bou...
متن کاملSchema-Guided Induction of Monadic Queries
The induction of monadic node selecting queries from partially annotated XML-trees is a key task in Web information extraction. We show how to integrate schema guidance into an RPNI-based learning algorithm, in which monadic queries are represented by pruning node selecting tree transducers. We present experimental results on schema guidance by the DTD of HTML.
متن کاملLearning Monadic Queries for Semi-Structured Documents from Positive Examples
Querying for nodes in trees is a core operation for information extraction from semi-structured documents in XML or HTML. We show that regular monadic queries for nodes in trees can be identified from positive examples, and this in polynomial time when represented by deterministic node selecting transducers that we introduce.
متن کاملLearning Node Selecting Tree Transducer from Completely Annotated Examples
A base problem in Web information extraction is to find appropriate queries for informative nodes in trees. We propose to learn queries for nodes in trees automatically from examples. We introduce node selecting tree transducer (NSTT) and show how to induce deterministic NSTTs in polynomial time from completely annotated examples. We have implemented learning algorithms for NSTTs, started apply...
متن کاملDecision Problems of Tree Transducers with Origin
A tree transducer with origin translates an input tree into a pair of output tree and origin info. The origin info maps each node in the output tree to the unique input node that created it. In this way, the implementation of the transducer becomes part of its semantics. We show that the landscape of decidable properties changes drastically when origin info is added. For instance, equivalence o...
متن کامل